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Abstract 

"Classical" First Order (FO) algorithms of convex optimization, such as Mirror Descent 
algorithm or Nesterov's optimal algorithm of smooth convex optimization, are well known 
to have optimal (theoretical) complexity estimates which do not depend on the problem 
dimension. However, to attain the optimality, the domain of the problem should admit a 
"good proximal setup". The latter essentially means that 1) the problem domain should 
satisfy certain geometric conditions which we refer to as "favorable geometry", and 2) the 
practical use of these methods is conditioned by our ability to solve efficiently an auxiliary 
optimization task - computing proximal transformation - at each iteration of the method. 
More often than not these two conditions are satisfied in optimization problems arising in 
computational learning, what explains the fact that FO methods of proximal type recently 
became methods of choice when solving various learning problems. Yet, they meet their 
limits in several important problems such as multi-task learning with large number of tasks, 
where the problem domain does not exhibit favorable geometry, and learning and matrix 
completion problems with nuclear norm constraint, when the numerical cost of solving the 
auxiliary problem becomes prohibitive in large-scale problems. 

We propose a novel approach to solving nonsmooth optimization problems arising in 
learning applications where Fenchel-type representation of the objective function is available. 
The approach is based on applying FO algorithms to the dual problem and using the accuracy 
certificates supplied by the method to recover the primal solution. While suboptimal in terms 
of accuracy guaranties, the proposed approach does not rely upon "good proximal setup" for 
the primal problem but requires the problem domain to admit a Linear Optimization oracle 
- the ability to efficiently maximize a linear form on the domaine of the primal problem. 



1 Introduction 

Motivation and background. The problem of interest in this paper is a convex optimization 
problem in the form 

Opt(P) =max/,(x) (P) 

where X is a nonempty closed and bounded subset of Euclidean space Ex, and /* is concave 
and Lipschitz continuous function on X. We are interested in the situation where the sizes 
of the problem put it beyond the "practical grasp" of polynomial time interior point methods 
with their rather computationally expensive in the large scale case iterations. In this case the 
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methods of choice are the First Order (FO) optimization techniques. The state of the art of 
these techniques can be briefly summarized as fohows: 

• The most standard FO approach to (P) requires to provide X, Ey with proximal setup || • || , u}{-), 
that is to equip the space Ex with a norm || • ||, and the domain X of the problem ~ with a 
distance-generating function (d.-g.f.) : X — )■ R which should be convex and continuous on 

X, admit a continuous in x G X° = {x £ X : duj{x) 7^ 0} selection uj'{x) of subgradients, and 
be strongly convex, modulus 1, w.r.t. || • ||: 

{oj'{x) — ijj'{x'),x — x) > \\x — x'll^. (1) 

After such a setup is fixed, generating an e-solution to the problem (i.e., a points x^ & X 
satisfying Opt(P) — < e) costs at most A^(e) steps, where 

• A^(e) = 0(1)^2 — in the nonsmooth case, where /=„ is Lipschitz continuous, with constant 
L w.r.t. II • II (Mirror Descent (MD) algorithm, see, e.g., [8| Chapter 5]), and 

• N{e) = 0{1)^^ in the smooth case, where /=„ possesses Lipschitz continuous, with con- 
stant -D^, gradient: ||/^(a;) — /^(x')||* < D'^\\x — x'\\, where || • || is the norm conjugate to 
II • II (Nesterov's optimal algorithm for smooth convex optimization, see, e.g., [TTj). 



In the above bounds, fix = ^[X,uj{-)] = A/2[maXj;gx u}{x) — min^gx ^{x)] is the uj-diameter of 
X; here and in the sequel 0(l)'s stand for positive absolute constants. 

A step of a FO method essentially reduces to a single computation of /=„, at a point and 
a single computation of the prox-mapping 

Prox^(^) := argmin [(^ — uj\x),x') + uj{x')] . 
x'ex 

for a pair x G X^,£, G E^. 

• A different way of processing (P) by FO methods, originating in the breakthrough paper of 
Nesterov [H] , is to use Fenchel-type representation of /=„ : 

(x) = mill [F{x,y) := {x,Ay + a) +i;{y)], (2) 

where y is a closed and bounded subset of Euclidean space Ey and ^{y) is a convex function. 
Representations of this type are readily available for a wide family of "well-structured" nons- 
mooth objectives moreover, usually we can make (j) to possess Lipschitz continuous gradient 
or even to be linear (for instructive examples, see, e.g., [11] or [51 Chapter 6]). Whenever this is 
the case, and given proximal setups (|| • ||a;5'^x(')) (II ' Wy^^yi')) {Ex,X) and for {Ey,Y), 
we can find an e-solution to (P) in 

A^(e) < 0{l) "^^^^^ "^^xy^X^Y + Lyyi^Y 

steps (Nesterov's smoothing [11] or the Mirror Prox algorithm, see, e.g., [51 Chapter 6]). Here 
fix = ^^[X, ^^x(')]) ^Y = ^\Xj^y{')]): s-iid Lxx, Lyy, Lxy are the partial Lipschitz constants of 
VF{x,y), namely, 

y{x,x' eX,y,y' eY): 

\\VxF{x',y') - VxF{x,y)\\x,* < Lxx\\x' - x\\x + Lxy\\y' - y\\y, 
\\VyF{x',y') - VyF{x,y)\\y^^ < Lxy\\x' - x\\x + Lyy\\y' - y\\y, 
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and II • \\x,*, II • llj;,* are the norms conjugate to || • H^:, || • \\y, respectively. A step of the method 
requires a single computation of VF(-) at a point and computing the values of 0(1) prox- 
mappings associated with {X,ujx{-)) and {Y,ujy{-)). 

Clearly, to be practical, methods of the outlined type should rely on "good" proximal setups 

- those resulting in "moderate" values of Ox and Cly and not too difficult to compute prox- 
mappings, associated with lox andcoy- This is indeed the case for domains X arising in numerous 
applications (for instructive examples, see, e.g., see [SI Chapter 5]). The question addressed in 
this paper is what to do when one of the domains, namely, X does not admit a "good" proximal 
setup. Here are two instructive examples: 

A. X is the unit ball of the nuclear norm ||i7(-)||i in the space R^^"^ oi p x q matrices (from 
now on, for a p X g matrix x, a{x) = [o"i(2;); C7niin[p,g](3^)] denotes the vector comprised by 
singular values of x taken in the non-ascending order). This domain arises in various low-rank- 
oriented problems of matrix recovery. In this case, X does admit a proximal setup with Qx = 
0{l)y^ln{pq). However, computing prox-mapping involves full singular value decomposition of 
a p X q matrix and becomes prohibitively time-consuming when p, q are in the range of tens 
of thousand. Note that this hardly is a shortcoming of the existing proximal setups, since 
already computing the nuclear norm (that is, checking the inclusion x G X) requires an SVD 
decomposition of x. 

B. X is a high-dimensional box — the unit ball of the || • ||oo" 

norm of R"* with large m, or, more 
generally, the unit ball of the norm ||x||oo|2 = niaxi<j<m ||2;*||2, where x = [x^; E 

E = R^i X ... X R"™. Here it is easy to point out a proximal setup with an easy-to-compute 
prox mapping (e.g., the Euclidean setup || • || = || • \\2, w(x) = ^{x,x)). However, it is easily seen 
that whenever || • || is a norm satisfying ||ej|| > 1 for all basic orths 6^0, one has Qx ^ 0{l)^/m, 
that is, the (theoretical) performance of "natural" proximal FO methods deteriorates rapidly as 
m grows. 

Note that whenever a prox-mapping associated with X is "easy to compute," it is equally easy 
to maximize over X a linear form (since FroXx{—t^) converges to the maximizer of x) over X 
as i — )• oo). In such a case, we have at our disposal an efficient Linear Optimization (LO) oracle - 
a routine which, given on input a linear form ^, returns a point xxi^) £ Argmax^.^^^ x). This 
conclusion, however, cannot be reversed - our abilities to maximize, at a reasonable cost, linear 
functionals over X does not imply the possibility to compute a prox-mapping at a comparable 
cost. For example, when X € R^^'' is the unit ball of the nuclear norm, maximizing a linear 
function x) = Tr(^a;"^) over X requires finding the largest singular value of a p x q matrix 
and associated left singular vector. For large p and q, solving the latter problem is by orders 
of magnitude cheaper than computing full SVD a p x q matrix. This and similar examples 
motivate the interest, especially in Machine Learning community, in optimization techniques 
solving (P) via an LO oracle for X. In particular, the only "classical" technique of this type 

- the Conditional Gradient (CG) algorithm going back to Frank and Wolfe [7J - has attracted 
much attention recently. In the setting of CG method it is assumed that / is smooth (with 
Lipschitz/Holder continuous gradient), and the standard result here (which is widely known, 
see, e.g., [HEKTS]) is the following. 



^This is a natural normalization: indeed, ||ej|| <^ 1 means that j-th coordinate of x has a large Lipschitz 
constant w.r.t. || ■ |[, in spite of the fact that this coordinate is "perfectly well behaved" on X - its variation on 
the set is just 2. 
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Proposition 1.1 Let X be a closed and bounded convex set in a Euclidean space Ex such that 
X — X linearly spans E^. Assume that we are given a point xi €z X and an LO oracle for X , 
and let /* be a concave continuously differentiable function on X such that for some C < oo and 
q £ (1, 2] one has 

yx, x'eX: fM > f,{x) + (/:(x),x' - x) - -C\\x' - x\\\, (3) 

q 

where \\ ■ \\x is the norm on Ex with the unit ball X — X . Consider the recurrences 
(a) xt+i G Argma^xeAt f*(.^)^ ^t = [xt, XX {fi{xt))], 
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(b) xt+i = xt + Xt[xx{fiixt)) - xt], Xt - , 
where xxiO ^ Ai'g™^Xi-ex(^) x) and xi £ X . Then for all t = 2,3, ... one has 



(4) 



q'^ ^ D{t + q — 2y recurrence (a) 



x&x max 



' q{^-i) 



D{t + lY recurrence (6) 



Contents of this paper. Assuming an LO oracle for X available, the major limitation in 
solving (P) by the Conditional Gradient method is the requirement for problem objective /* to 
be smooth (otherwise, there are no particular requirements to the problem geometry). What 
to do if this requirement is not satisfied? In this paper, we investigate two simple options 
for processing this case, based on Fenchel-type representation ([2|) of /*. Primarily, we focus 
on "nonsmooth" approach: we assume that such a representation is available and involves a 
Lipschitz continuous convex function ij) given by a First Order oracle (i.e., a black box which, 
given on input y £ Y, returns the value ip{y) and a subgradient il^'{y) of ■0 at y). Besides this, 
we assume that Y (but not X\) does admit a proximal setup (|| • \\y,ujy{-)). In this case, we can 
pass from the problem of interest (P) to its dual 



Opt(Z)) = min 



f{y) := max F{x,y) 

x&X 



, F{x,y) = {x,Ay + a)+i:{y). (D) 



Clearly, the LO oracle for X along with the FO oracle for ip provide a FO oracle for (D): 
f{y) = {x{y),Ay + a) + V'(y), f'{y) = ^^x(y) + -^'{y), x{y) := xx{Ay + a). 

Since Y admits a proximal setup, this is enough to allow to get an e-solution to (D) in A^(e) = 

0{1)—^ steps, L being the Lipschitz constant of / w.r.t. \\-\\y. Whatever slow the resulting rate 
of convergence could look, we shall see in the mean time that there are important applications 
where this rate seems to be the best known so far. When implementing the outlined scheme, the 
only nontrivial question is how to recover a good optimal solution to the problem (P) of actual 
interest from a good approximate solution to its dual problem (D). The proposed answer to this 
question stems from the recent (and pretty simple at the first glance) machinery of accuracy 
certificates proposed recently in jlOj, and closely related to the work The summary of our 
approach is as follows. When solving (D) by a FO method, we generate search points yr £ Y 
where the subgradients fivr) of / are computed; as a byproduct of the latter computation, 
we have at our disposal the points Xr = x{yr). As a result, after t steps we have at our 
disposal execution protocol = {j/r, /'(2/T)}t=i- accuracy certificate associated with this 
protocol is, by definition, a collection A* = {A* }^^]^ of nonnegative weights A* summing up to 
1: J2t=i = 1. The resolution of the certificate is, by definition, the quantity 

t 

e(y*,A*) = max^At(/(yr),yr-y). 
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An immediate observation is (see section [2]) that setting y* = X]t=i K-Vr^ 2* = Z]t=i K-^r, we 
get a pair of feasible solutions to {D) and to (P) such that 

[/(y*) - Optp)] + [Opt(P) - < e(y*, A*)- 

Thus, assuming that the FO method in question produces, in addition to search points, accuracy 
certificates for the resulting execution protocols and that the resolution of these certificates goes 
to as t ^ oo at some rate, we can use the certificates to build feasible approximate solutions 
to {D) and to (P) with nonoptimalities, in terms of the objectives of the respective problems, 
going to at the same rate. 

The scope of the outlined approach depends on whether we are able to equip known methods 
of nonsmooth convex minimization with computationally cheap mechanisms for building "good" 
accuracy certificates. The meaning of "good" in this context is exactly that the rate of conver- 
gence of the corresponding resolution to is identical to the standard efficiency estimates of the 
methods (e.g., for MD this would mean that e(y*,A*) < 0(l)LQyf-V2). pj provides a positive 
answer to this question for the most attractive academically polynomial time oracle-oriented 
algorithms for convex optimization, like the Ellipsoid method. These methods, however, usually 
are poorly suited for large-scale applications. In this paper, we provide a positive answer to the 
above question for the three most attractive oracle-oriented FO methods for nonsmooth convex 
optimization known to us. Specifically, we consider 

• MD (where accuracy certificates are easy to obtain, see also [12j), 

• Full Memory Mirror Descent Level (MDL) method (a Mirror Descent extension to the 
Bundle-Level method [9j; to the best of our knowledge, this extension was not yet described in 
the literature), and 

• Non-Euclidean Restricted Memory Level method (NERML) originating from [2], which 
we believe is the most attractive tool for large-scale nonsmooth oracle-based convex optimiza- 
tion. To the best of our knowledge, equipping NERML with accuracy certificates is a novel 
development. 

We also consider a different approach to non-smooth convex optimization over a domain 
given by LO oracle, approach mimicking Nesterov's smoothing [11]. Specifically, assuming, as 
above, that is given by Fenchel-type representation ^ with Y admitting a proximal setup, we 
use this setup, exactly in the same way as in (T7J, to approximate /=„ by a smooth function which 
then is minimized by the CG algorithm. Therefore, the only difference with [llj is in replacing 
Nesterov's optimal algorithm for smooth convex optimization (which requires a good proximal 
point setup for X) with although slower, but less demanding (just LO oracle for X is enough) 
CG method. We shall see in the mean time that, unsurprisingly, the theoretical complexity of 
the two outlined approaches - "nonsmooth" and "smoothing" one - are essentially the same. 

The main body of the paper is organized as follows. In section [21 we develop the components 
of the approach related to duality and show how an accuracy certificate with small resolution 
yields a pair of good approximate solutions to (P) and (D). In section [3l we show how to equip 
the MD, MDL and NERML algorithms with accuracy certificates. In section U we investigate 
the outlined smoothing approach. In section \5\ we consider examples, primarily of Machine 
Learning origin, where we prone the usage of the proposed algorithms. Some technical proofs 
are relegated to the appendix. 
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2 Duality and accuracy certificates 



2.1 Situation 

Let Ex be an Euclidean space, X C Ex he a nonempty closed and bounded convex set equipped 
with LO oracle ~ a procedure which, given on input ^ £ Ex, returns a maximizer xxiO of 
the linear form {^,x) over x & X. Let /*(a;) be a concave function given by Fenchel-type 
representation: 

f^x) = min [{x, Ay + a) + tp{y)] , (6) 

where F is a closed compact subset of an Euclidean space Ey and ijj is a Lipschitz continuous 
convex function on Y given by a First Order oracle. 
In the sequel we set 

f{y) = ma^ [{x, Ay + a) + ip{y)] , 
and consider two optimization problems 

Opt(P) = max/,(x) (P) 

xeX 

Opt{D) = min/(y) (D) 
By the standard saddle point argument, we have Opt(P) = Opt{D). 

2.2 Main observation 

Observe that the First Order oracle for ip along with the LO oracle for X provide a First Order 
oracle for {D); specifically, the vector field 

f'{y) = A^xxiAy + a) + ^'{y) : Y ^ Ey, 

where ^'(y) G dil){y) is a subgradient field of /. 

Consider a collection y* = {j/t- G Y, f'{yT)}\.^i along with a collection A* = {A,- > 0}*=i such 
that YX=i '^r = 1) and let us set 

y(y*,A*) = EUAr?/r, 

x{y\\^) = Yi=iKxx{Ayr + a), 

e(y*,A*) = maxX;r=i V(/'(yT),yT -y)- 
y&Y 

In the sequel, the components yr of y* will be the search points generated by a First Order 
minimization method as applied to {^D) at the steps 1, We call y* the associated execution 
protocol, call a collection A* of t nonnegative weights summing up to 1 an accuracy certificate 
for this protocol, and refer to the quantity €(y*,A*) as to the resolution of the certificate A* at 
the protocol y*. 

Our main observation is as follows: 

Proposition 2.1 Let y*, A* be as above. Then x := x(y*. A*), y := y(y*. A*) are feasible solutions 
to problems {P), (D), respectively, and 

f{y) - Mx) = [f{y) - Opt{D)] + [Opt(P) - Mx)] < e(y*. A*). (7) 
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Proof. Let F{x,y) = {x,Ay + a) + tp{x) and x{y) = xxi^y + a), so that f{y) = F{x{y),y). 
Observe that f'{y) = Fy{x{y),y), where Fy{x,y) is a selection of the subdifferential of F w.r.t. 
y, that is, Fy{x,y) G dyF{x,y) for all x £ X, y £ Y. Setting Xr = x{yr), we have for ah y £Y: 



e{y\\') > Y.^r{nyr),yr-y) = Y.^r{F'y{xr,yr),yr-y) 

T=l T=l 

t 

> [F{xr, yr) - F{xr, y)] [by convexity of F in y] 



T = l 

t 



= [fiyr] - F{Xr,y)] [sinCC Xr = x(y^), SO that F{Xr, yr) = f (yr)] (8) 

r=l 

> f{y) — F(x,y) [by convexity of / and concavity of F{x,y) in x]. 
We conclude that 

e{y\ A*) > max [/(y) - F{x, y)] = f{y) - 
The inclusions x £ X, y £ Y are evident. □ 



Remark 2.1 In the proof of Proposition 12. 1[ the linearity of F w.r.t. a; was never used, so that in fact 
we have proved a more general statement: 

Given a concave in x £ X and convex in y € Y Lipschitz continuous function F{x,y), let us associate with 
it a convex function f (y) = maXxGX F(x,y), a concave function f<,{x) = min^gy i^(a;, y) and problems (P) 
and{D). Let Fy(x,y) be a vector field with Fy{x,y) £ dyF(x,y), so that with x(y) £ Aigmax^^x F{x,y), 
the vector f'{y) = Fy{x{y),y) is a subgradient of f at y. Assume that problem (D) associated with F 
is solved by a FO method using f'{y) ~ Fy{x{y),y) which produced execution protocol j/* and accuracy 
certificate A*. Then setting 

a; = ^ Xrx{yr), and y = ^ KVr, 

r T 

we ensure ([7]). 

Moreover, let S >0, and let xgijj) be a 5-maximizer of F{x,y) in x £ X: for all y £Y, 

F{xs{y),y) > Taa.yix(zxF{x,y) - 6. 

Suppose that (D) is solved by a FO method using approximate subgradients f'{y) — Fy{xs{y),y), and pro- 
ducing execution protocol = {yr, /'(2/T)}r=i ^'^'^ accuracy certificate A*. Then setting x = Xxsiyr) 
and y = ^ryr, we ensure the 6 -relaxed version of ^ - the relation 

t 

f{y) - f*{x) < e{y\X')+6, e{y\X) = max^ A^(/'(2/.), ~ y). 

All we need to extract the "Moreover" part of this statement from the proof of Proposition 12.11 is to set 
(2/7-), to replace f'iyr) with f'iyr) and to replace the equality in (jSj with the inequality 

t t 
J2 Ar [F{xr,yr) ~ F{xr, y)] > J] A, [f{yr) - S - F{xr, y)] . 

T=l T=l 

Discussion. Proposition 12.1 1 savs that whenever we can equip the subsequent execution proto- 
cols generated by a FO method, as applied to the dual problem {D), with accuracy certificates, 
we can generate solutions to the primal problem (P) of inaccuracy going to at the same rate 
as the certificate resolution. In the sequel, we shall point out some "good" accuracy certificates 
for several most attractive FO algorithms for nonsmooth convex minimization. 
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3 Accuracy certificates in oracle-oriented methods for large- 
scale nonsmooth convex optimization 

3.1 Convex minimization with certificates, I: Mirror Descent 
3.1.1 Proximal setup. 

As it was mentioned in the introduction, the Mirror Descent (MD) algorithm solving (D) is given 
by a norm || • || on E'j^ and a distance-generating function (d.-g.f.) uj{y) : y — t- R which should 
be continuous and convex on Y, should admit a continuous in y G Y° = {y £ Y : duj{y) ^ 0} 
selection of subdifferentials Lo'{y), and should be strongly convex, modulus 1, w.r.t. || • ||, that 
is, 

yy,y'GY":{u;'iy)-oj'{y'),y-y') > \\y-y'f. 
A proximal setup (|| • ||,a;(-)) for Y,Ey gives rise to several entities, namely, 

• Bregman distance Vy{z) = uj{z) — oj{y) — {uj'{y),z — y) {y £ Y°,z £ Y). Due to strong 
convexity of w, we have 

y{zGY,yeY"):Vy{z)>^\\z-yf; (9) 

• w-center y^j = argmin^^gy a;(y) of Y and w-diameter 

n = n[Y,u}{-)] := ^ 

Observe that 



maxa;(y) — min uj{y) 



{u:'{y^),y-y^)>0, (10) 

(see Lemma lA.ip . so that 

Vy^ (z) < uj{z) - uj{y^) < e y, (11) 

which combines with the inequality Vy{z) > ^\\z — to yield the relation 

yy£Y:\\y-y^\\<n; (12) 

prox-mapping 

Proxy(0 = argmin [(^, z) + Vy{z)] , 
where ^ ^ Ey and y G Y° . This mapping takes its values in Y° and satisfies the identity 

V(y G G Ey,y+ = Prox,(0) : (^,2/+ - z) < Vy{z) - Vy^{z) - Vy{y+) Vz G Y, (13) 

see Lemma lA.li 
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3.1.2 Mirror Descent algorithm 

MD algorithm works with a vector field 

y ^ g{y) :Y^Ey, (14) 

which is oracle represented, meaning that we have access to an oracle which, given on input 
y GY, returns g{y)- From now on we assume that this field is bounded: 

\\giy)\U<L[g]<oo,yyGY, (15) 

where || • ||* is the norm conjugate to || • ||. The algorithm is the recurrence 

yi = Vw^yr ^ Qt ■■= giVr) ^ Vr+i ■■= PrOXy^(7^5^), (MD) 

where 7,- > are stepsizes. Let us equip this recurrence with accuracy certificates, setting 

^* = (EU>)''[7i;-;7*]. (16) 

Proposition 3.1 For every t, the resolution 

t 

e(2/*, A*) := m&^^\r{g{yr),yT - y) 



of on the execution protocol = {yr; 5'(yT)}r=i satisfies the standard MD efficiency estimate 



In particular, = jj^g^, lit) ■= ^^ for 1 < t < tE, 



.te',A')<5^^ (18) 

Proof is given by the standard MD rate-of-convergence derivation: 

Vz G y : {-frgT,yT+i - z) < Vy^{z) - Vy^^^{z) - Vy^{yr+i) [see ^] 
^VzGF: {-trgr^yr - z) <Vy^{z) -Vy^^^{z) + [ {g^,y^-y^^i) - Vy^{yr+i) ] 



<7T||9Tl|.l|?/T-J/r + l|| > 

VZ G y : (7r5r,yr - z) < Vy^{z) - Vy^^,{z) + i7rlbr||* 

yzGY: Et=l lt{gr,yr - Z) <Vy^-Vy^J£^ + \ Y!r=l llMl 

<\Q? >0 

V. € y : K{gr,yr - z) < "^^gr-^-"^'^^ = 



where the concluding =^ is given by (jlSp □ 

^We assume here that 7^ for aU r <t. In the opposite case, the situation is trivial: when g{yT,) = 0, for 
some T* < t, setting = for t ^ t, and A*^ = 1, we ensure that A') — 0. 
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Solving (P) and (D) using MD In order to solve (D), we apply MD to the vector field 
g = f- Assuming that 

L; = sup{||/'(2/)||, = P*x(y) + V''(y)ll*}<oo [x{y) = xx{Ay + a)], (19) 

we can set L[g] = Lf. For this setup, Proposition 12.11 implies that the MD accuracy certificate 
A*, as defined in p^ . taken together with the MD execution protocol y* = {i/n 9(yr) = fiUr) '■= 
A* x{yr) +a + V''(?/r)}r=ii yield the primal-dual feasible approximate solutions 



x' = J2xixr, f = Y^X\yr (20) 

T = l T=l 

to (P) and to {D) such that 

f{y')-Ux')<e{y\\'). 
Combining this conclusion with Proposition 13.11 we arrive at the following result: 

Corollary 3.1 In the case of (|19p . for every t = 1,2, ... the t-step MD with the stepsize policS^ 

n 

7t = , 1<T <t 

Vi\\f'{xr)\U' - - 

as applied to (D) yields feasible approximate solutions x*, y* to (P), (D) such that 

[fm - opt(p)] + [opm - /,(x*] < ^ 

In particular, given e > 0, it takes at most 

2t2\ 



tie) = Ceil ( ) (22) 



steps of the algorithm to ensure that 

[fit) - Opt(P)] + [Opt{D) - /,(x*)] < 6. (23) 



3.2 Convex minimization with certificates, II: Mirror Descent with full mem- 
ory 

Algorithm MDL - Mirror Descent Level method - is a non-Euclidean version of (a variant of) the 
Bundle-Level method |9j; to the best of our knowledge, this extension was not presented in the 
literature. Another novelty in what follows is equipping the method with accuracy certificates. 

MDL is a version of MD with "full memory", meaning that the first order information on 
the objective being minimized is preserved and utilized at subsequent steps, rather than being 
"summarized" in the current iterate, as it is the case for MD. While the guaranteed accuracy 
bounds for MDL are similar to those for MD, the typical practical behavior of the algorithm is 
better than that of the "memoryless" MD. 

■^We assume that /'(j/t) 0, for t < t; otherwise, as we remember, the situation is trivial. 
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3.2.1 Preliminaries 



MDL with certificates which we are about to describe is aimed at processing an oracle-represented 
vector field (fH|) satisfying ([15]), with the same assumptions on Y and the same proximal setup 
as in the case of MD. 

We associate with y £Y the affine function 

hy{z) = {g{y),y - z) > f{y) - f{z), 

and with a finite set S CY the family J^s of affine functions on Ey which are convex combinations 
of the functions hy{z), y G S. In the sequel, the words "we have at our disposal a function 
/i(-) G Ts" mean that we know the functions hy{-), y £ S, and nonnegative weights Xy, y £ S, 
summing up to 1, such that h{z) = J2yeS '^y^yi^)- 

The goal of the algorithm is, given a tolerance e > 0, to find a finite set S QY and h G J^s 
such that 

max h{y) < e. (24) 

Note that our target S^h are of the form 5* = {yi,...,yt}, h{y) = J2t=i ^T{g{yT),yT — y) with 
nonnegative A,- summing up to 1. In other words, our target is to build an execution protocol 
y* = {^Ti 9(yr)}t=i 3nd an associated accuracy certificate A* such that e(y*. A*) < e. 



3.2.2 Construction 

As applied to ()14p . MDL at a step t = 1,2,... generates search point yt £ Y where the value 
g{yt) of g is computed; it provides us with the affine function ht{z) = {g{yt),yt — z). Steps of 
the method are split into subsequent phases numbered s = 1, 2, and every phase is associated 
with optimality gap Ag > 0. 

To initialize the method, we set yi = y^-, Sq = 0, Aq = +oo. 

At a step t we act as follows: 

• given yt, we compute g{yt), thus getting /it(-), and set S^^ = St-i U {t}; 

• we solve the auxiliary problem 

et = max min hriy) = maxmin < } \^hr{y) : A,- > 0, / A,- = 1 / (25) 

y^Yr^St ' rest J 

By the von Neumann lemma, an optimal solution to this (auxiliary) problem is associated 
with nonnegative and summing up to 1 weights Ai^, r G such that 

et = max V X\hr{y), 

y&y — . 

rG5+ 



and we assume that as a result of solving ([25]) . both et and A^ become known. We set 
A* = for all r < t which are not in St^ , thus getting an accuracy certificate A* = [A^; A*] 
for the execution protocol = {yr, 5(yT)}r=i along with /i*(-) = J2t=i K-hri')- Note that 
by construction 

e(y*,A*) = max/i*(y) = e*. 

y& 

If et < e, we terminate - h{-) = /i*(-) satisfies ()24p . Otherwise we proceed as follows: 
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• If (case A) et < 7As_i, 7 G (0, 1) being method's control parameter, we say that step t 
starts phase s (e.g., step t = 1 starts phase 1), set 

As = et, St = {T:l<T<t:Xi>0}U {1}, yt = 

otherwise (case B) we set 

St = St, yt = yt- 

Note that in both cases we have 

et = max min hr{y) (26) 

ySiY reSt 

• Finally, we define t-th level as it = 764 and associate with this quantity the level set 
Ut = {y & Y : hr{y) > it'^T £ St}, specify yt+i as the w-projection of yt on Ut'. 

yt+i = argmin [uj{y) - {uj'{yt),y)] (27) 

and loop to step t + 1. 
3.2.3 Efficiency estimate 

Proposition 3.2 Given on input a target tolerance e > 0, the MDL algorithm terminates after 
finitely many steps, with the output = {Y 9 yr, 5(yT)}r=iJ = {-^t ^ 0}t=1' St K- — ^ such 
that 

e{y\\')<e. (28) 
The number of steps of the algorithm does not exceed 

^=-^^^^\-2+^ [^ = m^{-)]]- (29) 
7^(1 — 7^)e"^ 

For proof, see Section [A. 2[ 

Remark 3.1 Assume that uj{-) is continuously differentiable on the entire Y , so that the quan- 
tity 

= 0+[y,w] := max Vy{z) 

y,z(iY 

is finite. From the proof of Proposition 13.21 it follows immediately that one can substitute the 
rule "yt = y^ when t starts a phase and yt = yt otherwise" with a simpler one "y^ = yt for all 
t," at the price of replacing Vt in ()29p with f]"*". 

Solving (P) and [D) via MDL is completely similar to the case of MD: given a desired 
tolerance e > 0, one applies MDL to the vector field g{y) = f'{y) until the target is 
satisfied. Assuming (fT9]) . we can set L[g\ = Lf, so that by Proposition 13.31 our target will be 
achieved in ^ ^ 

t{e) < Ceil [^ ^^^^^^y^ + 1) [^ = m^{-)]] (30) 

steps, with Lf given by (fT9]) . Assuming that the target is attained at a step t, we have at our 
disposal the execution protocol y* = {y,-, f'iyT)}t=i along with the accuracy certificate A* = {A^} 
such that e(y*,A*) < e (by the same Proposition 13. 2p . Therefore, specifying x*, y* according to 
(j20|) and invoking Proposition 12.11 we ensure (|23p . Note that the complexity t = t(e) of finding 
these solutions, as given by ()30p . is completely similar to the complexity bound (|22p of MD. 
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3.3 Convex minimization with certificates, III: restricted memory Mirror 
Descent 

The fact that the number of hnear functions /ir(") involved into the auxihary problems ()25p . 
(j27p (and thus computational complexity of these problems) grows as the algorithm proceeds 
is a serious shortcoming of MDL from the computational viewpoint. NERML (Non-Euclidean 
Restricted Memory Level Method) algorithm, which originates from is a version of MD "with 
restricted memory" . In this algorithm the number of pieces in the models never exceeds a given 
number m, a control parameter which can be set to any desired integer value. The original 
NERML algorithm, however, was not equipped with accuracy certificates, and our goal here is 
to correct this omission. 

3.3.1 Construction 

Same as MDL, NERML processes an oracle-represented vector field p4|) satisfying the bound- 
edness condition (fT5]l . with the ultimate goal to ensure (f2l|) . The setup for the algorithm is 
identical to that for MDL. 

The algorithm builds search sequence yi £ Y,y2 & Y, ... along with the sets Sr = {yi, ...,yr}, 
according to the following rules: 

A. Initialization. We set yi = y^j := argmin gy a;(?/), compute g{yi) and set /i = max/iy^(y). 
We clearly have /i > 0. 

• In the case of /i = 0, we terminate and output h{-) = hy^{-) G Fs^, thus ensuring ([^^ 
with e = 0. 

• When /i > 0, we proceed. Our subsequent actions are split into phases indexed with 
s = l,2,.... 

B. Phase s = 1, 2, ... At the beginning of phase s, we have at our disposal 

• the set 5* = {yi, C Y of already built search points, and 

• an affine function h^{-) £ J-S" along with the real fs ■= inaxh^{y) £ (0,/i]. 

We define the level is of phase s as 

4 = 7/s, 

where 7 E (0, 1) is a control parameter of the method. Note that -^^ > due to fs > 0. 

To save notation, we denote the search points generated at phase s as ui,U2,..., so that 

yts+T = Ur, T = 1,2,.... 

B.l. Initializing phase s. We somehow choose collection of m functions /ioj(") ^ ^S": 
1 < j < m, such that the set 

Yo' = d{y£Y: h^^^iy) > 4, 1 < i < m] 

is nonempty (here a positive integer m is a control parameter of the method) We set 

ui = yui- 

B.2. Step T = 1, 2, ... of phase s: 

B.2.1. At the beginning of step r, we have at our disposal 

^Note that to ensure the nonemptiness of Fo^i it suffices to set fto,i(') = ^"(Oi ^o that ho.j{y) > £s for 
y £ Argmaxy h''{-); recall that fs = maxj^gy h''{y) > 0. 
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1. the set S^_i of all previous search points; 

2. a collection of functions {h^_ij{-) € -7^s•= .^}jLl such that the set 

y/_i = cl{x E y : K_^j{x) > 4, 1 < J < m} 

is nonempty, 

3. current search point Ur G Yr-i such that 



argminu;(y) (11^) 



Note that this relation is trivially true when r = 1. 

B.2.2. Our actions at step r are as follows. 
B.2.2.1. We compute g{ur) and set 

hr-l,m+l{y) = {g{Ur),Ur -y). 

B.2.2. 2. We solve the auxiliary problem 



Opt = max min hT-i j{y) (31) 



Note that 



m+1 m+1 

= maxEr=+/AJ/.^_,,,.(y), 

where AJ > and Y.f=i AJ = 1- We assume that when solving the auxiliary problem, we 
compute the above weights AJ, and thus have at our disposal the function 



such that 



m+1 

E AJ^r-i,,(-)e-^5? 



Opt = max/i''''^(y). 



B.2.2. 3. Case A: If Opt < e we terminate and output h^''^{-) £ J's^] this function satisfies 



Case B: In case of Opt < 4 + 0{fs — ^s), where 6 £ (0, 1) is method's control parameter, we 
terminate phase s and start phase s + 1 by setting = h'^''^ , fs^i = Opt. Note that by 

construction < fs+i < [7 + 0(1 — j)]fs < /i, so that we have at our disposal all we need to 
start phase s + 1. 

Case C: When neither A nor B takes place, we proceed with phase s, specifically, as follows: 
B.2.2. 4. Note that there exists a point u £ Y such that h^_i j{u) > Opt > ig, so that the set 
Y-j- = {y £ Y : h^_^j{y) > ^s, 1 < j < m + 1}, intersects with the relative interior of Y. We 
specify Ur+i as 

Ur+i = argmina;(y). (32) 
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Observe that 

Ur+i G (33) 

due to Yr C 

B.2.2.5. By optimality conditions for ()32p (see Lemma lA.ip . for certain nonnegative fij, 1 < 
j < m + 1, such that 

^j[/i^_i_j(n^+i) - 4] = 0, 1 < i < m + 1, 

the vector 

m+l 

e:=^'(n.+i)- ^MjV/i^-i,,(-) (34) 
i=i 

is such that 

{e,y-Ur+i)>OyyeY. (35) 

• In the case of /x = /^j > 0) we set 

-j^ m+l 
^ i = l 

SO that 

(a) /i^iE^s?, (ft) /i^i(^/r+i) =4, (c) (w'(u^+i) -/iV/i^i,y-^/^+i) > OVy G y 

(36) 

We then discard from the collection {h^_i j{-)}^J'i two (arbitrarily chosen) elements and add 
to /i^ I the remaining m — 1 elements of the collection, thus getting an m-element collection 
{/i^ ji^i of elements of Ts^- 

Remark 3.2 We have ensured that the set = cl{y e Y : j{y) > ^s, 1 < j < m} is 
nonempty (indeed, we clearly have ^{u) > ^s, 1 < J < "t-, where u is an optimal solution 
to (pijl V Besides this, we also have Indeed, by construction itr+i G 17-, meaning 

that j{ur+i) > isi Since ^ are convex combinations of the functions 

h^-ij, 1 < j < m + 1, it follows that Ur+i £ Y^ . Further, (l36l 6) and (|36lc) imply that 
Ur+i = argmiUy {^{y) ■ y &Y, i{y) > is} , and the right hand side set clearly contains 
Y^. We conclude that Ur+i indeed is the minimizer of on Y^. 

• In the case of /x = 0, ()34p - ()35p say that u^+i is a minimizer of a;(-) on Y. In this case, 
we discard from the collection {h^_i one (arbitrarily chosen) element, thus getting the 
m-element collection {/i^^jjl]^. Here, for exactly the same reasons as above, the set Y^ := 
cl{y £Y : h^j{y) > ^s} is nonempty and contains u^+i, and, of course, (11^) holds true (since 
Ur+i minimizes a;(-) on the entire Y). 

In both cases (those of ^ > and of ;U = 0), we have built the data required to start step 
T + 1 of phase s, and we proceed to this step. 
The description of the algorithm is completed. 

Remark 3.3 Same as MDL, the outlined algorithm requires solving at every step two nontrivial 
auxiliary optimization problems ~ ()3ip and ()32p . It is explained in p| that these problems are 
relatively easy, provided that m is moderate (note that this parameter is under our full control) 
and Y and uj are "simple and fit each other," meaning that we can easily solve problems of the 
form 

min [iii{x) + (a, x)] (*) 
(that is, our proximal setup for Y results in easy-to-compute prox- mapping) . 



15 



Remark 3.4 By construction, the presented algorithm produces upon termination (if any) 

• an execution protocol y* = {i/r, g{yT)}t=i^ where t is the step where the algorithm ter- 
minates, and ?/r) 1 < "7" < i) are the search points generated in course of the run; by 
construction, all these search points belong to Y; 

• an accuracy certificate A* - a collection of nonnegative weights Ai,...,Af summing up 
to 1 - such that the affine function h{y) = J2t=i ^T{g{yT),yT — y) satisfies the relation 
e(y*. A*) := max/i(x) < e, where e is the target tolerance, exactly as required in (|24p . 



3.3.2 Efficiency estimate 

Proposition 3.3 Given on input a target tolerance e > 0, the NERML algorithm terminates 
after finitely many steps, with execution protocol and accuracy certificate A*, described in 
Remark \3.4\ The number of steps of the algorithm does not exceed 

N = C(7, e)^^^^^, where C(7, 6) = f^^/^ ^a^2^ ■ (37) 



e 



72[l-[7 + (l-7)^F]' 



For proof, see Section [A.3I 



Remark 3.5 Inspecting the proof of Proposition it is immediately seen that when uj{-) is 
continuously differentiable on the entire Y , one can replace the rule (I32p with 

Ur+i = argmin[a;(?/) - {uj'{y^),y- y^)], 
y&Yr 

where is an arbitrary point of Y° = Y. The cost of this modification is that of replacing Q in 
the efficiency estimate with see Remark 13.11 Computational experience shows that a good 
choice of is the best, in terms of the objective, search point generated before the beginning 
of phase s. 



Solving (P) and (D) by NERML is completely similar to the case of MDL, with the bound 



^^^L^(l + 7^ ^ 
72[1- [7 + (l-7)0]2e2 



^(^)<c^M .,2M ,..12.2 [n = m^m (38) 



in the role of (1301). 



Remark 3.6 Observe that Propositions l3. 1113.31 do not impose restrictions of the vector field g{-) 
processed by the respective algorithms aside from the boundedness assumption (jlSp . Invoking 
Remark 12. 11 we arrive at the following conclusion: 

in the situation of section \2.1\ and given 6 > 0, let instead of exact maximizers x{y) G 
Argmax^jgjsf (a;, Ay+a), approximate maximizers xs{y) € X such that {xs{y), Ay+a) > {x(y),Ay+ 
a) — 6 for all y gY be available. Let also 

f's{y) = A*xs{y) + ^P'{y) 

be the associated approximate subgradients of the objective f of (D). Assuming 

Lf,5 = sup \\fs{y)\\ < oo. 
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let MD/MDL/NERML be applied to the vector Held g{-) = f'g{-). Then the number of steps of 
each method before termination remains bounded by the respective bound (US]), (I2S]) or ([57]) . 
with Lf^s in the role of Lf. Beside this, defining the approximate solutions to (P), (D) according 
to (j20p . with x-r = xsiyr), we ensure the validity of 6-relaxed version of the accuracy guarantee 
([23|) . specifically, the relation 

[/(y*) - Opt(P)] + [OptiD) = Mx'] <e + S. 



4 An alternative: Smoothing 

An alternative to the approach we have presented so far is based on the use of the proximal 
setup for Y to smoothen and then to maximize the resulting smooth approximation of /=„ by 
the Conditional Gradient (CG) algorithm. This approach is completely similar to the one used 
by Nesterov in his breakthrough paper with the only difference that since in our situation 
domain X admits LO oracle rather than a good proximal setup, we are bounded to replace the 
0(l/t^)-converging Nesterov's method for smooth convex minimization with 0(l/t)-converging 
CG. 

Let us describe the CG implementation in our setting. Suppose that we are given a norm 
II • lla; on Ex, a representation ([2]) of /=„, a proximal point setup (|| • \\y, w(-)) for Y and a desired 
tolerance e > 0. We assume w.l.o.g. that mina;(y) = and set, following Nesterov [llj, 

/f (x) = min [(x. Ay + a) + V(y) + P^iv)] 

y<^Y (39) 
/3 = /3(e) := jf^, ^ = il[Y,uj{-)] 

From ([2]) , the definition of Vl and the relation miny u = Q \t immediately follows that 

VxGX:/,(x) </i^(x) </,(x) + |, (40) 

and clearly is concave. It is well knowrj^ that strong convexity, modulus 1 w.r.t. || • \\y, of 
oj{y) implies smoothness of , specifically, 

V(x, x'eX): ||V/f (x) - V/f (x')IU,* < ^mlx^Jx - x'|U, (41) 

where 

||^||y;a;,* = max{ || A*u|| j^,* : u G Ex, \\u\\x < 1} = max{||Ay||^,^, : y £ Ey, \\y\\y < 1}. 

Observe also that under the assumption that an optimal solution y{x) of the right hand side 
minimization problem in ()39p is available at a moderate computational cost, H we have at our 
disposal a FO oracle for : 

/f (x) = (x, Ay{x) + a)+ ij{y{x)) + /3w(y(x)), V/f (x) = Ay{x) + a. 

We can now use this oracle, along with the LO oracle for X, to solve (P) by CG. In the sequel, 
we refer to the outlined algorithm as to SCGS (Smoothed Conditional Gradient). 



^To make the paper self-contained, we provide verification in Appendix. 

®In typical applications, 4' is just linear, so that computing y{x) is as easy as computing the value of the 
prox-mapping associated with Y, 
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Efficiency estimate for SCG is readily given by Proposition 12.11 Indeed, assume from now 
on that X is contained in II * lla^-ball of radius R of Ex- It is imniGciiajtcly S66n tlia-t und.Gr this 
assumption, ([^T]) implies the validity of the condition (cf. ([3]) with q = 2) 

yx,x'eX:fi{x')>f,^{x) + {Vfi{x),x'-x)-^£\\x'-x\\l, c = ^ = ^^. (42) 

In order to find an e-maximizer of it suffices, by (j40p . to find an e/2-maximizer of /=f ; by ([5]) 
(where one should set q = 2), what takes 

tSCG(^) ^ 0(i)e-^D = 0(l)^^^i^^p^ (43) 

steps. 



Discussion. Let us assume, as above, that X is contained in the centered at the origin ball 
of radius R, and let us compare the essentially identical to each othei0 complexity bounds (p2]) . 
(fSO]) . ([38|) . with the bound (03]). Under the natural assumption that the subgradients of tp we 
use satisfy the bounds ||V''(y)ll?/,* ^ -^v '^here L^p is the Lipschitz constant of tp w.r.t. the norm 
II ■ Wyi (USD implies that 

Lf <\\A\\y.^^^R + L^. (44) 
Thus, the first three complexity bounds reduce to 

C,„(,).Ceil(oa)HMW±|ffl^). (45, 

while the conditional gradients based complexity bound is 

We see that assuming < 0{l)R\\A\\y-x,* (which indeed is the case in many applications, in 
particular, in the examples we are about to consider), the complexity bounds in question are 
essentially identical. This being said, we believe that the two approaches in question seem to 
have their own advantages and disadvantages. Let us name just a few: 

• Formally, the SCG has a more restricted area of applications than MD/MDL/NERML, 
since relative simplicity of the optimization problem in (|39p is a more restrictive require- 
ment than relative simplicity of computing prox-mapping associated with Y,u}{-). At the 
same time, in most important applications known to us ^ is just linear, and in this case 
the just outlined phenomenon disappears. 

• An argument in favor of SCG, is its insensitivity to the Lipschitz constant of ip. Note, 
however, that in the case of linear ip (which, as we have mentioned, is the case of primary 
interest) the nonsmooth techniques admit simple modifications (not to be considered here) 
which make them equally insensitive to L^. 

• Our experience shows that the convergence pattern of nonsmooth methods utilizing mem- 
ory (MDL and NERML) is, at least at the beginning of the solution process, much better 
than is predicted by their worst-case efficiency estimates. It should be added that in theory 



^provided the parameters 7 £ (0, 1), 6 £ (0, 1) in H30|) . (|38p are treated as absolute constants. 
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there exist situations where the nonsmooth approach "most probably," or even provably, 
significantly outperforms the smooth one. This is the case when Ey is of moderate dimen- 
sion. A well-established experimental fact is that when solving (D) by MDL, every dim Ey 
iterations of the method reduce the inaccuracy by an absolute constant factor, something 
like 3. It follows that if n is in the range of few hundreds, a couple of thousands of MDL 
steps can yield a solution of accuracy which is incomparably better than the one predicted 
by the theoretical worst-case oriented 0(l/e^) complexity bound of the algorithm. More- 
over, in principle one can solve (-D) by the Ellipsoid method with certificates [TO], building 
accuracy certificate of resolution e in polynomial time 0{l)n?ln{LfQ[Y,io {■)/€). It follows 
that when dim Ey is in the range of few tens, the nonsmooth approach allows to solve, in 
moderate time, problems (P) and {D) to high accuracy. Note that low dimensionality of 
Ey by itself does not prevent X to be high-dimensional and "difficult;" how frequent are 
these situations in actual application, this is another story. 

We believe that the choice of one, if any, of the outlined approaches to use, is the issue which 
should be resolved, on the case-by-case basis, by computational practice. We believe, however, 
that it makes sense to keep them both in mind. 



5 Application examples 

In this section we work out some application examples, with the goal to demonstrate that the 
approach we are proposing possesses certain application potential. 



5.1 Uniform norm matrix completion 

Our first example (for its statistical motivation, see [B]) is as follows: given a symmetric p x p 
matrix b and a positive real R, we want to find the best entrywise approximation of i? by a 
positive semidefinite matrix x of given trace R, that is, to solve the problem 

mmx^x[-f*{x) = \\x - b\\oo] 
X = {x £ : X y 0, Tr(x) = R}, \\x\\oo = max \xij\ ^ ^ 

where is the space p x p symmetric matrices. Note that with our X, computing prox- 
mappings associated with all known proximal setups needs eigenvalue decomposition of an p x p 
symmetric matrix and thus becomes computationally demanding in the large scale case. On 
the other way, to maximize a linear form x) = Tr(,^3;) over x € X requires computing the 
maximal eigenvalue of ^ along with corresponding eigenvector. In the large scale case this task 
is by orders of magnitude less demanding than computing full eigenvalue decomposition. Note 
that our admits a simple Fenchel-type representation: 

= -\\x - 6||oo = miii[/(y) = {-x,y) + {b,y)] , Y = {y e : \\y\\i := V \yij\ < 1}. 

Equipping Ey = with the norm || • \\y = \\ ■ and Y with the d.-g.f. 

p 



i,j=l 



where a is an appropriately chosen constant of order of 1 (induced by the necessity to make uj{- 
strongly convex, modulus 1, w.r.t. || • ||i), we get a proximal setup for Y such that 

n[Y,uj{-)] < 0(l)Vlnp- 
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We see that our problem of interest fits well the setup of methods developed in this paper. 
Invoking the bounds we conclude that (|Tr|) can be solved within accuracy e in at 

most 

. o(i)I^ + II'-IIJMp) 

steps by any of methods MD, MDL or NERML, and in at most o(i) ^''^°(p) steps by SCG. 

It is worth to mention that in the case in question, the algorithms yielded by the nonsmooth 
approach admit a "sparsification" as follows. We are in the case of {x,Ay + a) = Tr(xy), and 
X = {x : X >: 0, Tr(x) = R}, so that x{y) = RcyCy, where Cy is the leading eigenvector of a 
matrix y normalized to have ||ey||2 = 1- Given a desired accuracy e > and a unit vector Cy^e 
such that Tr(y[ey^£ey ,,]) > Tr(y[eye^]) — R^^e, and setting x^iy) = Rcy^^e^,^, we ensure that 
G X and that x^{y) is an e-maximizer of {x,Ay + a) over x £ X. Invoking Remark 13.61 
we conclude that when utilizing Xe(-) in the role of x(-), we get 2e-accurate solutions to (P), 
(D) in no more than t{e) steps. Now, we can take as Cy^e the normalized leading eigenvector of 
an arbitrary matrix y{y) such that \\a{y — y)\\oo < £• Assuming R/e > 1 and given y £ Y, let 
us sort the magnitudes of entries in y and build y^ by "thresholding" - by zeroing out as many 
smallest in magnitude entries as is possible under the restriction that the remaining part of the 
matrix y is symmetric, and the sum of squares of the entries we have replaced with zeros does 
not exceed Re. Since ||y||i < 1, the number A^^^ of nonzero entries in y^, is at most 0(l)i?2/e2. 
On the other hand, by construction, the Frobenius norm — ye)\\2 of y — t/e is < R^^e, thus 
~ ye)\\oo ^ R we can take as e^^^ the normalized leading eigenvector of y^. When 

the size p of y is ^> R^/e^ (otherwise the outlined sparsification does not make sense), this 
approach reduces the problem of computing the leading eigenvector to the case when the matrix 
is question is relatively sparse, thus reducing its computational cost. 

5.2 Nuclear norm SVM 

Our next example is as follows: we are given an A^-element sample oipxq matrices Zj ("images") 
equipped with labels ej € { — 1, 1}. We assume the images to be normalized by the restriction 

||a(z,)lloo < 1. (49) 

We want to find a linear classifier of the form 

ej = sign((x, Zj) + b). [(z, x) = Tr(zx'^)] 

which predicts well labels of images. The "low-rank-oriented" SVM-based setting of this problem 
is 



mm 

x:\\a(x)\\i<R 



h(x) := min 
ben 



(50) 



where ||o"(-)||i is the nuclear norm, [a]+ = max[a,0], and i? > 1 is a parameter!! 

In this case the domain X of problem (P) is the ball of the nuclear norm in the space R^^*? 
oi p X q matrices and p, q are large. As we have explained in the introduction, same as in the 
example of the previous section, in this case the computational complexity of LO oracle is much 



*The restriction i? > 1 is quite natural. Indeed, with the optimal choice of x, we want most of the terms 
[1 — ej [{x, Zj) + b]]^ to be <^ 1; assuming that the number of examples with tj = —1 and tj = 1 are of order of 
A'', this condition can be met only when \{x,Zj)\ are at least of order of 1 for most of j's. The latter, in view of 
(Unj, implies that ||cr(a;)||i should be at least 0(1). 
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smaller than the complexity of computing prox-mapping. Thus, from practical viewpoint, in a 
meaningful range of values of p, q the LO oracle is "affordable," while the prox-mapping is not. 
Observing that [a]+ = maxo<y<i ya, and denoting 1 = [1; 1] G R"^, we get 

N f N 

E [1 - ^1 K^, ^i) + &]]+ = max ^ iV-i ^ [1 _ [(^, + 6]] 
i=i ^' [ i=i 

whence 

AT 

h{x) = maxiV"^ [1 - ej{x,Zj)] , (51) 

^^"^ ,=1 

where 

y = {y G : < y < 1,^ ^i^' = «}; (52) 

j 

from now on we assume that y 7^ 0. When setting 

N 

Ay = N-^Y.yj^i^i ■ ^ I^"'"'' ^ = e RP^« : ||a(x)||i < ^{y) = -N-H'^y (53) 
and passing from minimizing h{x) to maximizing /*(x) = —h{x), problem ()50p becomes 

(54) 



max 



/*(x) := mill + ip{y)] 



Let us equip Ey = R with the standard Euclidean norm || • ||2, and Y - with the Euclidean 



d.-g.f. u){y) = \y^y- Observe that 



[/'(y)] i = ^ ^ [(^(y). ^jZj) - 1] > 2;(y) G ArgmaxTr(xy^), 



meaning that 



II/' (y) Hoc < max iV-i[l + |k(x(y))||i||a(z,)lloo] < N-^[R + I] <2N-^ R. 



Using our notation of section [3] we have 



Lf :=sup||/'(y)||. <27V-i/2i? 



(we are in the case of || • ||=k = || • II2), and, besides. 



Qy < yN/2. 

We conclude that for every e > 0, the number t of MD steps needed to ensure ([25]) does not 
exceed 

iMD(e) = Ceil | j 
(see ([22])), and similarly for MDL, NERML, and SCG. 
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5.3 Multi-class classification under oo|2 norm constraint 

Our last example illustrates the potential of the proposed approach in the case when the domain 
X of (P) does not admit a proximal setup with "moderate" 0,x- Namely, let (P) be the problem 

min [— := \\Bx — 6||j/,*] [x i— )■ Bx : — )• Ey\ (55) 

where || • 11^^=,, is the norm conjugate to a norm || • \\y on Ey. We are interested in the case of 
box-type X, specifically, 

X = {[x'^;...;x^] G x ... x : \\x% < R, I < i < M} (56) 

As it was mentioned in Introduction, for every proximal setup (|| • ||,a;a;(-)) for X which is nor- 
malized by the requirement that simple "well behaved" on X convex functions should have 
moderate Lipschitz constants w.r.t. || • || (specifically, the coordinates of x £ X should have Lip- 
schitz constants < 1), one has a;a;(-)] > 0{1)^/MR. As a result, the theoretical complexity 
of the FO methods as applied to ([55]) grows with M at the rate at least 0(\/M), thus becoming 
prohibitively high for large M. We are about to show that the approaches developed in this 
paper are free of this shortcoming. Specifically, we can easily build a Fenchel-type representation 
of /*: 

/*(x) = -\\Bx - = mm[{B*y,x) - {b,y)] , Y = {y e Ey : \\y\\y < 1}. 

Assume that Y admits a good proximal setup. We can augment || • \\y with a d.-g.f. ujy{-) for Y 
such that II • ||j,,a;(y) form a proximal setup, and applying any of the methods we have developed 
in sections [3] and m the complexity of finding e-solution to ()55p by any of these methods becomes 

0(1) /" -^ll^ll^'^^'* \My,A ji^ii^^^^^ ^ ^g^^ (||Sx||j^,* : ||x||oo|2 := max||x*||2 < ij . 

V e / x=[xl;...;a;A^] L ' J 

Note that in this bound M does not appear, at least explicitly. 



Multi-class classification problem we consider is as follows: we observe "feature vectors" 
Zj G R*^, each belonging to one of M non-overlapping classes, along with labels Xj ^ ^^'^ which 
are basic orths in R^^; the index of the (only) nonzero entry in Xj is tbe number of class to 
which Zj belongs. We want to build a multi-class analogy of the standard linear classifier as 
follows: a multi-class classifier is specified by a matrix x G R*^^? and a vector b G R'^. Given a 
feature vector z, we compute the M-dimensional vector xz + b, identify its maximal component, 
and treat the index of this component as our guess for the serial number of the class to which 
z belongs. 

The multi-class analogy of the usual approach to building binary classifiers by minimizing 
the empirical hinge loss is as follows [31 |T]. Let Xj = 1 ~ Xj be the "complement" of Xj -Given a 
feature vector z and the corresponding label Xi let us set 

h = h{x, b; z, x) = [xz + 6] - [x'^ixz + 6]]1 + X G R*' [1 = [1; 1] G R^^]. 

Note that if i=K is the index of the only nonzero entry in then the z=K-th entry in h is zero (since 
Xi, = 1). Further, h is nonpositive if and only if the classifier, given by x,b and evaluated at 
z, "recovers the class of z with margin 1", i.e., we have [xz + b]j < [xz + b]i^ — 1 for j ^ i*. 
On the other hand, if the classifier fails to classify z correctly (that is, [xz + b]j > [xz + b]i^ for 
some j 7^ i^:), then the maximal entry in /i is > 1. Altogether, when setting 

V(.x, b; z,x) = ^ max [/i(x, b; z, x)]j , 

1<J<M 



22 



we get a nonnegative function which vanishes for the pairs {z, x) which are "quite rehably" 
- with margin > 1 - classified by {x,b), and is > 1 for the pairs {z,x) with z not classified 
correctly. Thus the function 

H{x,b) = E{r]{x,b;z,x)}, 

the expectation being taken over the distribution of examples is an upper bound on the 

probability for classifier (x, b) to misclassify a feature vector. What we would like to do now is to 
minimize H{x, b) over x, b. To do this, since H{-) is not observable, we replace the expectation 
by its empirical counterpart 

TV 

HN{x,b) = N-^^'q{x,b;zj,Xj)- 
i=i 

For the sake of simplicity (and, upon a close inspection, without much harm), we assume from 
now on that b = Oo Imposing, as it is always the case in hinge loss optimization, an upper 
bound on some norm of x, we arrive at the optimization problem 



mm 

x6X 



N 



t<M ■' 



X = {x: ||x|U < 



(57) 



From now on we assume that Zj's are normalized: 



\zjh < 1, 1 < J < iV. 



(58) 



Under this constraint, a natural (although not the only meaningful) choice of the norm || • 
is the maximum of the || • ||2-norms of the rows [x*]-^ of x. If we identify x with the vector 
[x^; ...;x^], X becomes the set ([56]) with ni = n2 = ... = um = Q, and the norm || • becomes 
II ■ lloo|2- The same argument as in the previous section allows us to assume that R> 1. 
Noting that max/ij = maxu{u^h : u > 0,X]i'Ui = 1}; (|57p can be rewritten as 



max 



f^{x) = min[(y, Bx) + V'(y)] 



(59) 



where 



Y = {y=[y^; y^] : y^ G Rf , E^[2/^k = N-\ I < j < N} e Ey 
Bx = [B^x; ...-B^x], B^x = \zj[x'^^^ ■ 

^(2/) = V(y\...,y^) = -Ef=i[yTx,; 



MN 



xi];...;zj[x^(j') 



R 



x"'] ,j = l,...,N 



(here is the class of zj, i.e., the index of the only nonzero entry in Xj)- Note that y is a part 



of the standard simplex Amn = {y ^ Rf ^ : Ej=iEi=i[y^i = 1} C Ey = K^'^ . Equipping 
\\y = II • 111 (so that II • lly^^K = II • lloo), and Y - with the entropy d.-g.f. 



Ey with the norm 



N M 

u;y{y) = Y.T.iyW[y'] 

j=i i=i 



^To arrive at this situation, one can augment Zj by additional entry, equal to 1, and to redefine x: the new x 
is the old [x, b]. 
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(known to complete || • ||i to a proximal setup for Amn), we get a proximal setup for Y with 
Cl = ^l[Y,u{-)] < ^2 ln(M). Next, assuming \\x\\x = ||ic||oo|2 < 1, we have 

llSxIL* = ||-Bx||oo = max IzHx*^-'^ - x'']\ < llz,- Iblb'^-'^ - x'-h < 2, 

iiy, l<i<m ' J ^ Ji II J II II 

l<j<iV 

SO that ||-B||a;;j/,* < 2. Furthermore, ip clearly is Lipschitz continuous with constant 1 w.r.t. 
II ■ lis/ = II ■ 111- It follows that the complexity of finding an e-solution to (|55p by MD, MDL, 
NERML or SCG is bounded by o(i) ^'MM) ^^^^ ^ ^^^^ .^^^ account that 

R > 1 and that what is now called B, was called A* in the notation used in those bounds, so 
that ||-B||x;y,* = ||^||j;;x,*)- Note that the resulting complexity bound is independent of and is 
"nearly independent" of M. Finally, prox-mapping for Y is given by a closed form expression 
and can be computed in linear time: 

argmin {Ef=iEfii[yWy]i) + Ef=i{^' ,y') ■■ y' > = 1, 1 < J < ivj 

r 1 ^ 

= \y' ■■ [y']i = — , i< i < M . 
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A Appendix: Proofs 
A.l Lemma on Prox-mapping 

Lemma A.l . Let Y he a nonempty closed and hounded subset of an Euclidean space Ey, and 
let II • II, u}{-) be the corresponding proximal setup. Let, further, U he a closed convex subset of 
Y intersecting the relative interior ofY, and let p £ Ey. 

(i) The optimization problem 

min/ip(y) := [{p,y) + u}{y)] 
yeu 

has a unique solution y^,. This solution is fully characterized by the inclusion y^: G U D Y°, 
= {y S y : duj{y) ^ 0}, coupled with the relation 

{p + u;\y^),u-y^) >0 WueU. (60) 

(ii) When U is cut off Y by a system of linear inequalities ei{y) < 0, i = l,...,m, there exist 
Lagrange multipliers Aj > such that Xiei{y^) = 0, 1 < i < m, and 

m 

Vney: {p + J{y,) + Y,\ie'i,u-y,)>0. (61) 

i=l 

(iii) In the situation of (ii), assuming p = £, — oj'{y) for some y £ Y°, we have 

G F : - n) - ^ A,e,(u) < Vy{u) - Vy,{u) - Vy{y,). (62) 

i 

Proof, (i): In one direction the statement is evident: if G ?7n X° satisfies ()60p . then clearly 
minimizes hp{-) on U. Further, the minimizer of hp on U clearly exists and is unique due to 
the strong convexity of /ip(-). Thus, all we need to prove is that the minimizer y^, of hp over U 
belongs to Y° and satisfies (|60|) . We can assume w.l.o.g. that Y spans the entire Ey and thus 
its relative interior is the same as its interior. Let y £ U Ci'mtY, and let yt = y* + t{y — y*), 
so that 2/t G [/ n int y C y° for < t < 1. Consequently, the function (j){t) = hp{yt) is convex, 
continuously differentiable on (0, 1] with the derivative </>'(t) = {oj'{yt) + p,y — y*), and attains 
its minimum on [0, 1] at t = 0, whence 4>'{t) > for < t < 1. Now let r > be such that the 
Euclidean ball B of radius r centered at y is contained in Y. Since hp is continuous on Y, we 
have hp{z) — hp{yt) < ^ < cxd for all z G i? and all t G [0, 1], whence (p + uj'{yt), z — yt) <V for 
all t G (0, 1] and all z £ B. On the other hand, 

V> max{p + uj'{yt),z - yt) = {p + uj' (yt) , y - yt) + r\\p + u' {yt)\\2 = (p' (t) + r\\p + u' {yt)\\2 

zeB 

> r\\p + u'{yt)\\2. 

We see that uj'{yt) remains bounded as t — )■ 0. Thus there exists a sequence ti — )• +0, i — )• oo, 
and e £ Ey such that uj'{ytj e as i ^ oo. Since uj is continuous on Y and yt. — >■ y^: as 
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i — )• oo, e G dijj{y^), that is, y=K € Y°, whence in fact uj'{yt) — )• oj'{y^), t — )• +0, and thus 
{p + uj'{y^),y -y^) = Hnit^+o^'W > 0- Thus, G Y° and holds true for all u e UnintY. 
The latter set is dense in U, so that ()60p holds true for all u € f7, as claimed in (i). 

(ii) : Let K = cl{d : y* + d £ Y} be the tangential cone of y at y*, and 

L = {d: ei{y^ + d) < Mi£l = {i: ei{y^) = 0}}. 

Since U = {y ^Y : ei{y) < 0} intersects int Y, the interior of the cone K intersects with L. This 
combines with ([5U|) to ensure that the vector p + aj'^y^t) belongs to the cone dual to K Ci L. By 
the Dubovitski-Milutin lemma this implies that p + u}' {y^, ) belongs to the arithmetics sum of the 
cones dual to K and to L, meaning that for some nonnegative Xi, i £ I, p + J2i£i ^i^i + ^'{u*) 
belongs to the cone dual to which is exactly what is stated in (ii). 

(iii) : With / and Aj,i G / as described in the proof of (ii), we have 

Vn G y : (e - uj'{y) + Eiei ^ie'i + uj'{y,),u - y*) > [see §1^] 

^ Vn G y : + ^i^'i^y* - < {^'{y*) - ^'{y),y - y*) = Vy{u) - Vy,{u) - Vy{y^), 

where the concluding equality is readily given by the definition oi V.{-) □ 
Applying item (i) of Lemma I A . 1 1 with Cj = (i.e., with U = Y), we arrive at {ijj'{y^),y — yj) > 

for all y G y, as required in (fTUj) . Applying item (iii) to the case of ej(n) = for all u, we get 
A. 2 Proof of Proposition 13.21 

l". Observe that when t is a non-terminal step of the algorithm, the level set Ut is a closed 
convex subset of Y which intersects the relative interior of Y; indeed, by ([26]) and due to et > 
for a non-terminal t, there exists y G y such that /ir(y) > > l^t-, t £ -S**, that is, Ut is cut 
off y by a system of constraints satisfying the Slater condition. Denoting St = {yTn2/T2; ■■■^yrk} 
and invoking item (iii) of Lemma lA.ll with y = y^, ^ = and ej(-) = 'jet — hni-) (so that 
Ut = {u £ Y : ei{u) < 0, 1 < i < k} , we get y=K = yt+i and 

i 

with some Aj > 0. When u G Ut, we have ej(n) < 0, that is, 

Vn G C/t : Vy,^,{u) < V-{u) - V-{yt+i). (63) 

2*^. When t starts a phase, we have yt = y^ = yi, and clearly 1 G St, whence hr{yt) < for 
some T £ St (specifically, for r = 1). When t does not start a phase, we have yt = yt and t £ St, 
so that here again hr{yt) < for some t £ St- On the other hand, hr{yt+i) > 7et for all t £ St 
due to yt+i £ Ut- Thus, when passing from yt to yt+i, at least one of hr{-) grows by at least 
'jet- Taking into account that hr{z) = {g{yT),yT — z) is Lipschitz continuous with constant L[g\ 
w.r.t. II • II (by (fT5]) ). we conclude that ||yt — yt+i|| > '^et/L[g\. With this in mind, (f63]l combines 
with ([9]) to imply that 

Vn G C/t : Vy,^,{u) < V^^{u) - \v-{yt+i) < Vy^u) - (64) 

3*^. Let the algorithm perform phase s, let tg be the first step of this phase, and r be another 
step of the phase. We claim that all level sets Ut, ts < t < r, have a point in common, specifically, 
(any) u £ Argmax^^gy min^-g^^ hriy). Indeed, since r belongs to phase s, we have 

J As < er = max min hriy) = min hr{u) 
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and As = et^ = maxj^gy minT-g^^^ hr{y) (see ([26]) and the definition of A^). Besides this, r 
belongs to phase s, and within a phase, sets St extend as t grows, so that St^ C St C Sr when 
ts 1^ t < r, implying that et^ > f-ts+i ^ ••• ^ ^r- Thus, for t £ {tg, Sg+i, ■■■,r} we have 

min hr{u) > min hr{u) > jAg = 764^ > jet, 

implying that u £ Ut- 

With the just defined u, let us look at the quantities vt := Vg^{u), ts < t < r. We have 
vt^ < 1^^^ due to yt^ = Uu) and ([12]), and 

when ts < t < r (due to (|64p combined with yt = yt when ts < t < r). We conclude that 
(r — ts)7^A^ < ri^L^[(7]. Thus, the number Ts of steps of phase s admits the bound 

< (65) 

4*^. Assume that MDL does not terminate in course of first T > 1 steps, and let s be the 
index of the phase to which the step T belongs. Then Ag > e (otherwise we would terminate not 
later than at the first step of phase s); and besides this, by construction, Ag+i < 7As whenever 
phase s + 1 takes place. Therefore 



A. 3 Proof of Proposition 13.31 

Observe that the algorithm can terminate only in the case A of B.2.2.3, and in this case the 
output is indeed as claimed in Proposition. Thus, all we need to prove is the upper bound ([37]) 
on the number of steps before termination. 

l''. Let us bound from above the number of steps at an arbitrary phase s. Assume that 
phase s did not terminate in course of the first T steps, so that ui, are well defined. We 

claim that then 

\\ur - Ur+i\\ > £s/L[g], 1 < r < T. (66) 

Indeed, by construction h^_^^_^_^{y) := {g{uT-),UT- — y) is > is = 7/s when x = Ur+i (due to 
Ur+i € Yr). Since ||5(^i)||* < L[g] for all u £Y, ([U5]) follows. 

Now let us look at what happens with the quantities uj{ur) as r grows. By strong convexity 
of (jo we have ^ 

Uj{Ur+l) - Uj{Ur) > {uj' (Ur) , Ur+1 - Ur) + -\\Ur - lir+lll^ 

The first term in the right hand side is > 0, since Ur is the minimizer of uj{-) over Y^_i, 
while Ur+i £ Yr C Y^_i. The second term in the right hand side is > by ([66]) . Since 

u}{ur+i) - uj{ur) > , we get uj(ut) - w(tii) > (T - 1)2^[^ = (^ - l)2b^- Recalling the 
definition of fi, the left hand side in this inequality is < ^fi^. It follows that whenever phase 
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s does not terminate in course of the first T steps, one has T < ^^^jl^^ + 1, that is, the total 
number of steps at phase s, provided this phase exists, is at most Tg = ^^^p^ + 2. Now, we have 

fs < fi= raa-x.{g{y^), y - Vuj) < L[g\ max \\y - y^|| < Q.L[g\ 
(recall that ||5'(y)||* < L\g\ and see (fT2]) ). Thus 



"2f| - 7^ /I 



for all s such that s-th phase exists. By construction, we have /s > e and fs < (7 + (1 — 7)^)/s-i5 
whence the method eventually terminates (since 7+(l— 7)^ < 1). Assuming that the termination 
happens at phase s, the total number of steps is bounded by 

J. (1 + 27^) ^^LM ^ (1 + 27^) ^^L\g\{n + (1 - 7)g)^(^"-^) 

J. (1 + 27^) l^^L2b](7 + (1 - 7)^)^^^"-^) ^ (1 + 27^) O^L^b] 

- A. ^2 ,2 - ^2[l_(^+(l_^)0)2] ,2 ' 

as claimed. □ 



A. 4 Verifying ( 14T]) 

For a fixed /? > 0, let 7/(2;) G argmin^gy \(x^Ay + a) + ^(y) + /3cj(2/)], so that {f*y{x) = Ay{x) + 
a. Let x,x' G X. Taking into account ip is Lipschitz continuous, by argument completely similar 
to the one in the proof of Lemma lA.H we have y{x) G Y", y{x') G Y" and 

{A*x, y{x') - y{x)) + {ijj' {y{x)),y{x') - y{x)) + I3{uj' {y{x)),y{x) - y{x)) > 
{A*x', y{x) - y{x)) + {'ijj' {y{x)),y{x) - y{x)) + I3{uj' {y{x)), y{x) - y{x)) > 

with properly selected V''(y(a^)) £ dip{y{x)), ip'{y{x')) G dip{y{x')). It follows that 

{x-x',Aiyix')-y{x))) > {i^' {y{x')) - ^P' {y{x)),y{x') - y{x)) 

+P{u'{y{x'))-u'{y{x)),y{x')-v{x)) 
> I3\\yix')-yix)\\l, 

and therefore 

\\x - x'\\^\\A\\y.^^4y{x) - y{x')\\y > \\x - x'\\^\\Ay{x) - Ay(x')|U,* > (x - x',A{y{x') - y{x))) 

> m^') - y{x)\\l- 

The latter implies that 

\\y{x) - y{x')\\y < r^\\A\L...J\x - x'" 



y;x,* ll-^ lU) 



\y;x,* 

SO that 

- (/f )'(x')IU,* = \\A{y{x) - 2/(x'))|U,* < /3-^|' ^"^ 
which is (SIl). □ 
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